Feature Markov Decision Processes 



Marcus Hutter 



00 
O 

o 

(N 

o 

CD 

Q 

in 

(N 



< 
o 



> 
o 

00 

in 

(N 
00 

o 



X 



RSISE @ ANU and SML @ NICTA 
Canberra, ACT, 0200, Australia 
marcusOhutterl .net www. hutter 1 .net 

24 December 2008 



Abstract 

General purpose intelligent learning agents cycle 
through (complex, non-MDP) sequences of obser- 
vations, actions, and rewards. On the other hand, 
reinforcement learning is well-developed for small 
finite state Markov Decision Processes (MDPs). 
So far it is an art performed by human design- 
ers to extract the right state representation out 
of the bare observations, i.e. to reduce the agent 
setup to the MDP framework. Before we can think 
of mechanizing this search for suitable MDPs, we 
need a formal objective criterion. The main contri- 
bution of this article is to develop such a criterion. 
I also integrate the various parts into one learning 
algorithm. Extensions to more realistic dynamic 
Bayesian networks are developed in the compan- 
ion article | Hut09| ■ 

Keywords: evolutionary algorithms, ranking se- 
lection, tournament selection, equivalence, effi- 
ciency. 

1 Introduction 

Background &; motivation. Artificial General Intelli- 
gence (AGI) is concerned with designing agents that per- 
form well in a wide range of environments [GP071 ILH07j . 
Among the well-established "narrow" AI approaches, ar- 
guably Reinforcement Learning (RL) pursues most di- 
rectly the same goal. RL considers the general agent- 
environment setup in which an agent interacts with an 
environment (acts and observes in cycles) and receives 
(occasional) rewards. The agent's objective is to collect 
as much reward as possible. Most if not all AI problems 
can be formulated in this framework. 

The simplest interesting environmental class consists of 
finite state fully observable Markov Decision Processes 
(MDPs) |Put94l ISB98] . which is reasonably well under- 
stood. Extensions to continuous states with (non)linear 
function approximation SB98, Gor99 , partial observ- 
ability ( POMDP ) |KLC98llRPPCd08j . structured MDPs 
(DBNs) SDL07]. and others have been considered, but 
the algorithms are much more brittle. 

In any case, a lot of work is still left to the designer, 



namely to extract the right state representation ("fea- 
tures") out of the bare observations. Even if potentially 
useful representations have been found, it is usually not 
clear which one will turn out to be better, except in situ- 
ations where we already know a perfect model. Think of 
a mobile robot equipped with a camera plunged into an 
unknown environment. While we can imagine which im- 
age features are potentially useful, we cannot know which 
ones will actually be useful. 

Main contribution. Before we can think of mechanically 
searching for the "best" MDP representation, we need a 
formal objective criterion. Obviously, at any point in time, 
if we want the criterion to be effective it can only depend 
on the agents past experience. The main contribution of 
this article is to develop such a criterion. Reality is a 
non-ergodic partially observable uncertain unknown envi- 
ronment in which acquiring experience can be expensive. 
So we want/need to exploit the data (past experience) at 
hand optimally, cannot generate virtual samples since the 
model is not given (need to be learned itself), and there is 
no reset-option. In regression and classification, penalized 
maximum likelihood criteria [HTF01, Clip. 7] have success- 
fully been used for semi-parametric model selection. It is 
far from obvious how to apply them in RL. Ultimately we 
do not care about the observations but the rewards. The 
rewards depend on the states, but the states are arbitrary 
in the sense that they are model-dependent functions of 
the data. Indeed, our derived Cost function cannot be 
interpreted as a usual model+data code length. 

Relation to other work. As partly detailed later, the 
suggested $MDP model could be regarded as a scaled- 
down practical instantiation of AIXI Hut05j IHut07j , as a 
way to side-step the open problem of learning POMDPs, 
as extending the idea of state-aggregation from planning 
(based on bi-simulation metrics [GDG03 ) to RL (based on 
code length), as generalizing U-Tree |McC96| to arbitrary 
features, or as an alternative to PSRs [SLJ + 03] for which 
proper learning algorithms have yet to be developed. 

Notation. Throughout this article, log denotes the bi- 
nary logarithm, e the empty string, and 8 x , y = 6 xy = 1 if 
x — y and else is the Kronecker symbol. I generally omit 
separating commas if no confusion arises, in particular in 
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indices. For any x of suitable type (string,vector,set), I 
define string x — x\a = X\...xi, sum x + — J2j x j> union 
x* ={Jj%j, and vector x, = (xi,...,x{), where j ranges over 
the full range {1,...,/} and l = \x\ is the length or dimension 
or size of x. x denotes an estimate of x. P(-) denotes a 
probability over states and rewards or parts thereof. I do 
not distinguish between random variables X and realiza- 
tions x, and abbreviation P(x) :=P[X = a;] never leads to 
confusion. More specifically, m € IN denotes the number 
of states, i £ {l,...,m} any state index, n G JV the cur- 
rent time, and t G {l,...,n} any time. Further, in order 
not to get distracted at several places I gloss over ini- 
tial conditions or special cases where inessential. Also 
0*undefined=0*infinity:=0. 

2 Feature Markov Decision Process 
($MDP) 

This section describes our formal setup. It consists 
of the agent-environment framework and maps $ from 
observation-action-reward histories to MDP states. I call 
this arrangement "Feature MDP" or short $MDP. 

Agent-environment setup. I consider the standard 
agent-environment setup |RN03j in which an Agent inter- 
acts with an Environment. The agent can choose from 
actions a€A (e.g. limb movements) and the environment 
provides (regular) observations ogO (e.g. camera images) 
and real- valued rewards r £lZ CM to the agent. The re- 
ward may be very scarce, e.g. just +1 (-1) for winning 
(losing) a chess game, and at all other times (Hut05, 
Sec. 6. 3]. This happens in cycles t= 1,2,3,...: At time t, 
after observing ot , the agent takes action a t based on his- 
tory h t :— o\a\r\...ot-\a,t-\Tt-\Ot- Thereafter, the agent 
receives reward r t . Then the next cycle t+1 starts. The 
agent's objective is to maximize his long-term reward. 
Without much loss of generality, I assume that A, O, and 
1Z are finite. Implicitly I assume A to be small, while O 
may be huge. 

The agent and environment may be viewed as a pair or 
triple of interlocking functions of the history H := (O x Ax 
11)* xO: 

Env : Ti x A x 1Z O, o n = Env(/j„_ia n _ir„_i), 
Agent : Ti ~» A, a n = Agent (h n ), 

Env : Ti x A 1Z, r n = Env(ft,„a n ). 

where indicates that mappings — > might be stochastic. 

The goal of AI is to design agents that achieve high 
(expected) reward over the agent's lifetime. 

(Un)known environments. For known EnvQ, finding 
the reward maximizing agent is a well-defined and formally 
solvable problem [Hut 05, Clip. 4], with computational ef- 
ficiency being the "only" matter of concern. For most 
real-world AI problems Env() is at best partially known. 

Narrow AI considers the case where function Env() is 
either known (like in blocks world), or essentially known 



(like in chess, where one can safely model the opponent as 
a perfect minimax player), or Env() belongs to a relatively 
small class of environments (e.g. traffic control). 

The goal of AGI is to design agents that perform well in 
a large range of environments |LH07| . i.e. achieve high re- 
ward over their lifetime with as little as possible assump- 
tions about Env(). A minimal necessary assumption is 
that the environment possesses some structure or pattern. 

From real-life experience (and from the examples below) 
we know that usually we do not need to know the complete 
history of events in order to determine (sufficiently well) 
what will happen next and to be able to perform well. Let 
$(/i) be such a "useful" summary of history h. 

Examples. In full-information games (like chess) with 
static opponent, it is sufficient to know the current state 
of the game (board configuration) to play well (the history 
plays no role), hence $(h t ) = o t is a sufficient summary 
(Markov condition). Classical physics is essentially pre- 
dictable from position and velocity of objects at a single 
time, or equivalently from the locations at two consecutive 
times, hence <5>{h t ) — o t Ot-i is a sufficient summary (2nd 
order Markov). For i.i.d. processes of unknown probability 
(e.g. clinical trials ~ Bandits), the frequency of observa- 
tions $>(h n ) — {J2t=i^°to)oeo is a sufficient statistic. In 
a POMDP planning problem, the so-called belief vector 
at time t can be written down explicitly as some func- 
tion of the complete history ht (by integrating out the 
hidden states). Q(h t ) could be chosen as (a discretized 
version of) this belief vector, showing that $MDP gener- 
alizes POMDPs. Obviously, the identity ^(h) — h is always 
sufficient but not very useful, since EnvQ as a function of 
Ti is hard to impossible to "learn" . 

This suggests to look for $ with small codomain, which 
allow to learn/estimate/approximate Env by Env such 
that o t «Env($(/i t _i)) for t=l...n. 

Example. Consider a robot equipped with a camera, i.e. 
o is a pixel image. Computer vision algorithms usually ex- 
tract a set of features from o t _i (or ht-i), from low-level 
patterns to high-level objects with their spatial relation. 
Neither is it possible nor necessary to make a precise pre- 
diction of o t from summary $(/i t _i). An approximate 
prediction must and will do. The difficulty is that the 
similarity measure "~" needs to be context dependent. 
Minor image nuances are irrelevant when driving a car, 
but when buying a painting it makes a huge difference in 
price whether it's an original or a copy. Essentially only a 
bijection $ would be able to extract all potentially inter- 
esting features, but such a $ defeats its original purpose. 

From histories to states. It is of utmost importance 
to properly formalize the meaning of in a general, 
domain-independent way. Let St '■= $(h t ) summarize all 
relevant information in history h t . I call s a state or 
feature (vector) of h. "Relevant" means that the future 
is predictable from (and at) alone, and that the rele- 
vant future is coded in st+is t+ 2---- So we pass from the 
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complete (and known) history o 1 air 1 ...o„a„r„ to a "com- 
pressed" history sari :n = siairi...s n a n r n and seek $ such 
that St+i is (approximately a stochastic) function of St 
(and at)- Since the goal of the agent is to maximize his 
rewards, the rewards r t are always relevant, so they (have 
to) stay untouched (this will become clearer below). 

The $MDP. The structure derived above is a classical 
Markov Decision Process (MDP), but the primary ques- 
tion I ask is not the usual one of finding the value func- 
tion or best action or comparing different models of a given 
state sequence. I ask how well can the state-action-reward 
sequence generated by $ be modeled as an MDP compared 
to other sequences resulting from different $. 

3 fMDP Coding and Evaluation 

I first review optimal codes and model selection methods 
for i.i.d. sequences, subsequently adapt them to our situ- 
ation, and show that they are suitable in our context. 1 
state my Cost function for $ and the <I> selection principle. 

I.i.d. processes. Consider i.i.d. x\...x n € X" for fi- 
nite X = {l,...,m}. For known 6i = P[xt = i] we have 
P(xi :n \6) = 9 Xl ■...■9 Xn . It is well-known that there exists a 
code (e.g. arithmetic or Shannon-Fano) for x\-. n of length 
— logP(xi :n |0), which is asymptotically optimal with prob- 
ability one. 

For unknown 6 we may use a frequency estimate Qi — 
rii/n, where rii = \{t : xt — Then — logP(iri : „|0) = 
nH(0), where H{8) ;= — Y^Li^d°E^i is the Entropy of 
6 (OlogO := =: Olog^). We also need to code (iii), which 
naively needs logn bits for each i. One can show that it 
is sufficient to code each ftj to accuracy 0(1 l\fn\ which 
requires only ilogn-t-O(l) bits each. Hence the overall 
code length of x\-. n for unknown frequencies is 

CL(xi :n ) = CL(n) := nH{n/n) + r -^f± logn (1) 

for n>0 and else, where n = (m,...,n m ) and n = n + = 
ni + ...+n rn and m! = \{i:ni>Q}\ <m is the number of non- 
empty categories. The code is optimal (within +0(1)) for 
all i.i.d. sources. It can be rigorously derived from many 
principles: MDL, MML, combinatorial, incremental, and 
Bayesian |Grii07] . 

In the following I will ignore the O(l) terms and refer 
to (fTJ) simply as the code length. Note that x\- n is coded 
exactly (lossless). Similarly (see MDP below) sampling 
models more complex than i.i.d. may be considered, and 
the one that leads to the shortest code is selected as the 
best model |Grii07| . 

MDP definitions. Recall that a sequence sar 1:n is said 
to be sampled from an MDP (S,A,T,R) iff the probability 
of s t only depends on s t _! and a t _i; and r t only on s t -i, 
fflt-i, and St- That is, 

P(st\ht- iat -i) = PCatlst-i.tk-i) =: T£~l st 
P(r t \h t ) = P(r t |* t -i,at-i,*t) =: KtUt 



For simplicity of exposition I assume a deterministic de- 
pendence of r t on St only, i.e. r t —R St ■ In our case, we can 
identify the state-space S with the states si,...,s„ "ob- 
served" so far. Hence S = {s 1 ,...,s m } is finite and typically 
m<n, i.e. states repeat. Let s— ►s'(r') be shorthand for 
"action a in state s resulted in state s' (reward r')". Let 
T s " :={t<n: St-i = s,a t -i = a,s t = s',r 4 = r'} be the set 
of times t—1 at which s— > sV, and n^, :— \T s a <S | their 
number (Tift = n) . 

Coding MDP sequences. For some fixed s and a, con- 
sider the subsequence s^-.-St , of states reached from s 
via a (s— *st ( ), i.e. {ti,...,i„'} =T£*, where n' = n"+. By 
definition of an MDP, this sequence is i.i.d. with s' occur- 
ring n' s , '-=n®*, times. By (TT]) we can code this sequence in 
CL(n') bits. The whole sequence si :n consists of |5x^4| 
i.i.d. sequences, one for each (s,a) £SxA. We can join 
their codes and get a total code length 

CL(«i in |oi, n ) - ^CL«. + ) (2) 

Similarly to the states we code the rewards. There are 
different "standard" reward models. I consider only the 
simplest case of a small discrete reward set 1Z like {0,1} 
or {— 1,0, +1} here and defer generalizations to M and a 
discussion of variants to the <i>DBN model | Hut09 . By the 
MDP assumption, for each state s', the rewards at times 
are i.i.d. Hence they can be coded in 

CL(n ) = £CL«,) (3) 

s' 

bits. I have been careful to assign zero code length to 
non-occurring transitions s— >s'r' so that large but sparse 
MDPs don't get penalized too much. 

Reward-s-s-state trade-off. Note that the code for r de- 
pends on s. Indeed we may interpret the construction as 
follows: Ultimately we/the agent cares about the reward, 
so we want to measure how well we can predict the re- 
wards, which we do with ([3"|). But this code depends on s, 
so we need a code for s too, which is ([2} . To see that we 
need both parts consider two extremes. 

A simplistic state transition model (small |<S|) results in 
a short code for s. For instance, for |<S| = 1, nothing needs 
to be coded and © is identically zero. But this obscures 
potential structure in the reward sequence, leading to a 
long code for r. 

On the other hand, the more detailed the state transi- 
tion model (large |<S|) the easier it is to predict and hence 
compress r. But a large model is hard to learn, i.e. the 
code for s will be large. For instance for $(/i)=/i, no state 
repeats and the frequency-based coding breaks down. 

$ selection principle. Let us define the Cost of <i> : > S 
on h n as the length of the $MDP code for sr given a: 

Cost($|fc„) := CL(si :n |oi :n ) + CL(n 

), (4) 

where St = &(h t ) and h t — oar\. t -\o t 
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The discussion above suggests that the minimum of the 
joint code length, i.e. the Cost, is attained for a $ that 
keeps all and only relevant information for predicting re- 
wards. Such a <!> may be regarded as best explaining the 
rewards. So we are looking for a $ of minimal cost: 



best 



arg min{ Cost ($ I h n ) } 



(5) 



The state sequence generated by $ best (or approxima- 
tions thereof) will usually only be approximately MDP. 
While Cost (<I>|/i) is an optimal code only for MDP se- 
quences, it still yields good codes for approximate MDP 
sequences. Indeed, $ best balances closeness to MDP with 
simplicity. The primary purpose of the simplicity bias is 
not computational tractability, but generalization ability 

[mo7irHuto5] . 

4 A Tiny Example 

The purpose of the tiny example in this section is to pro- 
vide enough insight into how and why $MDP works to 
convince the reader that our $ selection principle is rea- 
sonable. Consider binary observation space O = {0,1}, 
quaternary reward space 7?. = {0,1,2,3}, and a single ac- 
tion ,4={0}. Observations Ot are independent fair coin 
flips, i.e. Bernoulli(i), and reward r t — 2ot^i+o t a deter- 
ministic function of the two most recent observations. 

Considered features. As features $ I consider $fc:7Y^ 
O k with <^k{ht) — o t -k+i---Ot for various fc = 0,l,2,... which 
regard the last k observations as "relevant" . Intuitively $2 
is the best observation summary, which I confirm below. 
The state space S = {0,l} k (for sufficiently large n). The 
<£>MDPs for k = 0,1,2 are as follows. 



*iMDP 



$ MDP 



r = 0jl|2|3 r = 0|2 r = l|3 





$2MDP with all non-zero transition probabilities be- 
ing 50% is an exact representation of our data source. 
The missing arrow (directions) are due to the fact that 
s = o t -\o t can only lead to s' = o' t o' t+1 for which o' t = ot- 
Note that $MDP does not "know" this and has to learn 
the (non)zero transition probabilities. Each state has two 
successor states with equal probability, hence generates 
(see previous paragraph) a Bernoulli^) state subsequence 
and a constant reward sequence, since the reward can be 
computed from the state = last two observations. Asymp- 
totically, all four states occur equally often, hence the se- 
quences have approximately the same length n/4. 

In general, if s (and similarly r) consists of xGlN i.i.d. 
subsequences of equal length n/x over y£lN symbols, the 
code length ((2]) (and similarly ([3|)) is 



CL(s|a; x y ) 
CL(r|s, a; x y ) 



n log y + x^^- log g 



where the extra argument x y just indicates the sequence 
property. So for $2 MDP we get 

CL(s|a;4 2 ) = n + 61og§ and CL(r|s, a; 4i) = 6 log f 

The log-terms reflect the required memory to code (or 
the time to learn) the MDP structure and probabilities. 
Since each state has only 2 realized/possible successors, 
we need n bits to code the state sequence. The reward 
is a deterministic function of the state, hence needs no 
memory to code given s. 

The <fr MDP throws away all observations (left figure 
above), hence CL(s|a;l!) = 0. While the reward se- 
quence is not i.i.d. (e.g. r t +\ — 3 cannot follow r t = 0), 
$qMDP has no choice regarding them as i.i.d., resulting 
in CL(s|a;l4) = 2n+|logn. 

The $iMDP model is an interesting compromise (mid- 
dle figure above). The state allows a partial prediction 
of the reward: State allows rewards and 2; state 1 
allows rewards 1 and 3. Each of the two states creates 
a Bernoulli(i) state successor subsequence and a binary 
reward sequence, wrongly presumed to be Bernoulli(^). 
Hence CL(s|a;2 2 ) = n+log§ and CL(r|s,a;2 2 ) = n+31ogf . 

Summary. The following table summarizes the results 
for general A: = 0,1, 2 and beyond: 



Cost($ |fr) 


Cost($i|/i) 


Cost($ 2 |/i) 


Cost($ fe > 2 |/i) 


2n + |logn 


2n+41ogf 


n + 121ogf 


n+f^flogf 



For large n, $2 results in the shortest code, as anticipated. 
The "approximate" model $1 is just not good enough to 
beat the vacuous model <E>o , but in more realistic examples 
some approximate model usually has the shortest code. In 
[Hut09J I show on a more complex example how $ best will 
store long-term information in a POMDP environment. 

5 Cost(<fr) Minimization 

I have reduced the reinforcement learning problem to a 
formal ^-optimization problem. I briefly explain what 
we have gained by this reduction, and provide some gen- 
eral information about problem representations, stochas- 
tic search, and $ neighborhoods. Finally I present a sim- 
plistic but concrete algorithm for searching context tree 
MDPs. 

$ search. I now discuss how to find good summaries 
The introduced generic cost function Cost($|/i„), based 
on only the known history h n , makes this a well-defined 
task that is completely decoupled from the complex (ill- 
defined) reinforcement learning objective. This reduction 
should not be under-estimated. We can employ a wide 
range of optimizers and do not even have to worry about 
overfitting. The most challenging task is to come up with 
creative algorithms proposing $'s. 

There are many optimization methods: Most of them 
are search-based: random, blind, informed, adaptive, lo- 
cal, global, population based, exhaustive, heuristic, and 
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other search methods |AL97j . Most are or can be adapted 
to the structure of the objective function, here Cost(-|/i„). 
Some exploit the structure more directly (e.g. gradient 
methods for convex functions). Only in very simple cases 
can the minimum be found analytically (without search). 

General maps $ can be represented by/as programs for 
which variants of Levin search [Sch04, Hut05 and genetic 
programming are the major search algorithms. Decision 
trees/lists/grids are also quite powerful, especially rule- 
based ones in which logical expressions recursively divide 
domain Tt into "true/false" regions |San08j that can be 
identified with different states. 

<I> neighborhood relation. Most search algorithms re- 
quire the specification of a neighborhood relation or dis- 
tance between candidate $. A natural "minimal" change 
of $ is splitting and merging states (state refinement and 
coarsening). Let <&' split some state s" e 5 of $ into 
s\s c (£S 

if ^ s a 

s b or s c if <j>(/j) _ s a 

where the histories in state s a are distributed among s b 
and s c according to some splitting rule (e.g. randomly). 
The new state space is l S' = 5\{s a }U{s fc ,s c }. Similarly $' 
merges states s b ,s c £S into s a ^S if 

#( h ) ■= J «W if *W * Sa 
y ' \s a if ®(h) = s b or s c 

where S' = S\{s b ,s c }U{s s }. We can regard <&' as being a 
neighbor of or similar to <i>. 

Stochastic <fr search. Stochastic search is the method 
of choice for high-dimensional unstructured problems. 
Monte Carlo methods can actually be highly effective, de- 
spite their simplicity |Liu02j . The general idea is to ran- 
domly choose a neighbor $' of $ and replace $ by <&' if 
it is better, i.e. has smaller Cost. Even if Cost($'|/i) > 
Cost($|/i) we may keep but only with some (in the 
cost difference exponentially) small probability. Simulated 
annealing is a version which minimizes Cost ($| h). Appar- 
ently, $ of small cost are (much) more likely to occur than 
high cost <£>. 

Context tree example. The in Section 2] depended 
on the last k observations. Let us generalize this to a con- 
text dependent variable length: Consider a finite complete 
suffix free set of strings (= prefix tree of reversed strings) 
SCO* as our state space (e.g. 5 = {0,01,011,111} for bi- 
nary O), and define &s(h n ) := s iff o n _i s i_|_i :n = s€<S, i.e. 
s is the part of the history regarded as relevant. State 
splitting and merging works as follows: For binary 0, if 
history part s£5 of h n is deemed too short, we replace 
s by 0s and Is in S, i.e. S' = S\{s}L){0s,ls}. If histo- 
ries ls,0s&S are deemed too long, we replace them by s, 
i.e. S' = S\{0s,ls}U{s}. Large O might be coded binary 
and then treated similarly. The idea of using suffix trees 
as state space is from |McC 96 . For small O we have the 
following simple ^-optimizer: 



$Improve($ < s ,h n ) 

\ Randomly choose a state sES; 
Let p and q be uniform random numbers in [0,1]; 
if (p > 1/2) then split s i.e. S' = S\{s}\J{os:oeO} 
else if {os:oeO}CS 

then merge them, i.e. S' — S\{os:o&0}U{s}; 
if (Cost($ s |/i„)-Cost($ S '|/i n )>log(g)) then S:=S'; 
[ return 

6 Exploration 8z Exploitation 

Having obtained a good estimate <& of & best in the previ- 
ous section, we can/must now determine a good action for 
our agent. For a finite MDP with known transition prob- 
abilities, finding the optimal action is routine. For esti- 
mated probabilities we run into the infamous exploration- 
exploitation problem, for which promising approximate so- 
lutions have recently been suggested |SL08j . At the end of 
this section I present the overall algorithm for our <i>MDP 
agent. 

Optimal actions for known MDPs. For a known finite 
MDP (S,A,T,R,j), the maximal achievable ("optimal") 
expected future discounted reward sum, called (Q) Value 
(of action a) in state s, satisfies the following (Bellman) 
equations |SB98| 



T = E T "s' l R ss' + yVs*\ and V s * = max ( 



(6) 



where 0<7<1 is a discount parameter, typically close to 
1. See |Hut05[ Sec. 5. 7] for proper choices. The equations 
can be solved in polynomial time by a simple iteration 
process or various other methods |Put94] . After observing 
Oji+i, the optimal next action is 



fln+i := argmaxQ 



where s„+i = $(/i n+ i) (7) 



Estimating the MDP. We can estimate the transition 
probability T by 



J. ss t 



if nil > and else. 



(8) 



It is easy to see that the Shannon-Fano code of si :n based 
on Pf(si:n\ai:n) = Y[t = iTs t t ~is t phis the code of the non- 
zero transition probabilities T s a s , > to relevant accuracy 
0(1/ yJnIX) has length ([2]), i.e. the frequency estimate ([8]) 
is consistent with the attributed code length. The ex- 
pected reward can be estimated as 



r>a 



E 



R1 



■= ^ 0) 



Exploration. Simply replacing T and R in © and (f7|) by 

their estimates © and (HJ) can lead to very poor behav- 
ior, since parts of the state space may never be explored, 
causing the estimates to stay poor. 
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Estimate T improves with increasing ujt, which can 
(only) be ensured by trying all actions a in all states s 
sufficiently often. But the greedy policy above has no 
incentive to explore, which may cause the agent to per- 
form very poorly: The agent stays with what he believes 
to be optimal without trying to solidify his belief. Trad- 
ing off exploration versus exploitation optimally is compu- 
tationally intractable [Hut05l IPVHR061 IRP08j in all but 
extremely simple cases (e.g. Bandits). Recently, polyno- 
mially optimal algorithms (Rmax,E3,OIM) have been in- 
vented [KS981 ISL08| : An agent is more explorative if he 
expects a high reward in the unexplored regions. We can 
"deceive" the agent to believe this by adding another "ab- 
sorbing" high-reward state s e to S, not in the range of 
i.e. never observed. Henceforth, S denotes the ex- 
tended state space. For instance + in JHJ) now includes s e . 
We set 

n ss e = Ij n s c s = ^s e si Rss" = -R-max (10) 

for all s,a, where exploration bonus R% lax is polynomially 
(in (I-7)- 1 and |<Sx.A|) larger than maxK |SL08j . 

Now compute the agent's action by ©-© but for the 
extended S. The optimal policy p* tries to find a chain 
of actions and states that likely leads to the high reward 
absorbing state s e . Transition T s a sC =l/n™ + is only "large" 
for small n° + , hence p* has a bias towards unexplored 
(state, action) regions. It can be shown that this algorithm 
makes only a polynomial number of sub-optimal actions. 

The overall algorithm for our $MDP agent is as follows. 

*MDP- Agent (A,K) 

[ Initialize 3> = e; 6> = {e}; /io = ao = ro = e; 
for ra=0,l,2,3,... 
[ Choose e.g. 7 = 1 — l/(n+l); 

Set R e max =Polynomial( ( 1 — 7) _1 ,|Sx„4|)- max7£ ; 

While waiting for o n +i {$ :=$Improve($,/i n )}; 

Observe o n+ i; h n+ i = h n a n r n o n+1 ; 

s n+ i:=$(/i„+i); 5:=5U{s„+i}; 

Compute action a„+i from Equations (|6|)- (|10p : 

Output action a n +i\ 
[ Observe reward r n +i', 

7 Improved Cost Function 

As discussed, we ultimately only care about (modeling) 
the rewards, but this endeavor required introducing and 
coding states. The resulted Cost($|/i) function is a code 
length of not only the rewards but also the "spurious" 
states. This likely leads to a too strong penalty of models 
$ with large state spaces S. The proper Bayesian formu- 
lation developed in this section allows to "integrate" out 
the states. This leads to a code for the rewards only, which 
better trades off accuracy of the reward model and state 
space size. 

For an MDP with transition and reward probabilities 
Tg S , and R^, , the probabilities of the state and reward 



sequences are 

n n 

P(si:„k :n ) =Il T ^ t ' P(r lsn \s lm a lsn ) ^RZllZ 
t=i t—i 

The probability of r \a can be obtained by taking the prod- 
uct and marginalizing s: 

n 

Pu(ri:n\ai.. n ) ]J U% Z % =£[C/<^- • ■ U a -^] SoSn 

si- n t—1 s n 

where for each a £ A and r' £ TZ, matrix U ar ' £ ]R mxm is 
defined as [U ar '} ss , = U^,' :=T^ s ,RfJ- The right n-fold 
matrix product can be evaluated in time 0(m 2 n). This 
shows that r given a and U can be coded in —\ogPjj bits. 
The unknown U needs to be estimated, e.g. by the relative 
frequency l> a J,' The M :=m(m-l)\A\(\Tl\-l) 

(independent) elements of U can be coded to sufficient 
accuracy in |Mlogn bits. Together this leads to a code 
for r|o of length 

ICost($|/i„) := -logP {? (ri :n |ai : „) + |Mlog7i (11) 

In practice, M can and should be chosen smaller like done 
in the original Cost function, where we have used a restric- 
tive model for R and considered only non-zero transitions 
in T. 

8 Conclusion 

I have developed a formal criterion for evaluating and se- 
lecting good "feature" maps $ from histories to states 
and presented the feature reinforcement learning algo- 
rithm $MDP-Agent(). The computational flow is h ^> 
$ ~> {T,R) ^ {V,Q) a. The algorithm can easily and 
significantly be accelerated: Local search algorithms pro- 
duce sequences of "similar" $, which naturally suggests to 
compute/update Cost($|/i) and the value function V in- 
crementally. The primary purpose of this work was to in- 
troduce and explore <&-selection on the conveniently simple 
(but impractical) unstructured finite MDPs. The results 
of this work set the stage for the more powerful $DBN 
model developed in the companion article |Hut09j based 
on Dynamic Bayesian Networks. The major open prob- 
lems are to develop smart $ generation and smart stochas- 
tic search algorithms for $ bes * ; and to determine whether 
minimizing (|llj) is the right criterion. 
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